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1. Introduction 

> 

f— ^ , The fact that the maximum likelihood estimate in a logistic regression model 

£NJ ' may not exist is a well-known phenomenon and a number of recent papers 

(****■ , have explored its underlying geometrical basis. [9], [12] and [7] point out that 

existence, and non-existence, of the estimate can be fully characterised by con- 
sidering the closure of the model as an exponential family. In this formulation 
it becomes clear that the maximum is always well-defined, but can lie on the 
boundary rather than in the relative interior. Furthermore, the boundary can 
be considered as a polytope characterised by a finite number of extremal points. 
This paper builds on this work and shows that the boundary affects more 
than the existence of the maximum likelihood estimate. In particular, even when 
; "*] , the estimate exists, the geometry and boundary can strongly affect inference 

Cy_' procedures. First and higher order asymptotic results can not be uniformly ap- 

plied. Indeed, near the boundary, effects such as high skewness, discreteness and 
collinearity dominate, any of which could render inference based on asymptotic 
normality suspect. The paper presents a simple diagnostic tool which allows 
the analyst to check if the boundary is going to have an appreciable effect on 
standard inferential techniques. The tool, and the effect that the boundary can 
have, are illustrated in a well-known example and through simulated datascts. 

Example 1. The Fisher iris data set, [8], is often used to illustrate classification 
and binary regression. Even in this familiar case we show that the boundary is 
close enough to have significant effects for inference. Let us focus on the problem 
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of distinguishing the species Iris setosa - coded with 1 in figures - from Iris 
versicolor (coded with 0) based on the length of the flower's sepal. The left hand 
panel in Fig. 1 shows the logistic regression fit, while the right hand panel shows a 
contour plot of the log-likelihood for intercept parameter a and slope parameter 
/3. The near singularity of the observed Fisher information is evident. While 
this does have a geometric interpretation, this particular collinearity effect can 
easily be removed by centring the explanatory variable around its sample mean. 
For clarity of exposition we work exclusively with the centred model from now 
on. 



Log-likelihood contours 
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Fig 1. Fisher Iris data example: fit and log-likelhood contours 



Figure 2, to be discussed in Section 3, shows various aspects of sampling dis- 
tributions under the maximum likelihood fit. These show failure of first, and 
higher, order asymptotics due to skewness and discreteness in a number of dif- 
ferent ways. These effects can be explained by the closeness of the boundary - 
shown in panel (a) as points connected with solid line segments - in the sequel. 
In this example, the boundary is just close enough to start to play a significant 
role. More extreme examples are shown in Section 3. 







Fig 2. Fisher Iris data example: effects of the boundary on the sampling distribution 



2. Overview of geometry 

This section looks at the geometry underlying logistic regression. Section 2.1 
describes the more general case of the geometry of extended multinomial models, 
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while Section 2.2 focuses on logistic regression. Section 2.3 defines the diagnostic 
which allows the analyst to check if the boundary is close enough to substantially 
effect first order asymptotic results. Finally, in Section 3, we return to Example 
1 and related variants. 



2.1. Extended multinomial models 

The information geometry of the extended multinomial model is considered in 
[1]. The cell probabilities of the extended multinomial define the simplex 

A k := J7r = (^ ,^ 1 ,..., 7 r fe ) T : n > , J^m = 1 I . (2.1) 

The term 'extended' refers to the fact that this is the closure of the multinomial 
model: zero cell probabilities are allowed. The closure of exponential families 
has been studied by [2], [4], [11] and [5]. Of central interest in this paper is the 
consequence of working in a closed extended family, rather than in the more 
familiar open parameter space. 

One important feature - with obvious implications for first order asymptotics 
- is that the Fisher information can become arbitrarily close to singular as 
you approach the boundary. This is clearly shown by considering its spectral 
decomposition. With tti \ denoting the vector of all probabilities except 7To, the 
Fisher information matrix for the natural (log-odds) parameters, written as a 
function of ■k, is the sample size times 

I (it) := diag{n (0) ) - n {0) irf 0) . 

Its explicit spectral decomposition is an example of interlacing eigenvalue re- 
sults, (see for example [10], Chapter 4). In particular, suppose {7r i }*L 1 com- 
prises g > 1 distinct values Ai > • • • > A g > 0, A; occuring m, times, so that 
S?=i m j = k- Then, the spectrum of I(ir) comprises g simple eigenvalues {Ai}f =1 
satisfying 

Ai > Ai > ••• > A g > A g > 0, (2.2) 

together, if g < k, with {A^ : m, > 1}, each such Aj having multiplicity m.; — 1. 
While the closure of the full multinomial model is easy to understand in 
the representation (2.1), the closure of lower dimensional sub-models of A k 
expressed in the natural parameters, such as logistic regression models, are more 
problematic to compute - although they can be critical infcrentially. In order 
to visualise the geometry of the problem of computing limits in exponential 
families, consider a low-dimensional example. 

Example 2. Define a two dimensional full exponential subfamily 7r(a,/3) of A 3 
where 

n(a, (3) oc (exp{craii + (3v 2 i}, exp{cwi 2 + f3v 2 2}, exp{awi 3 + ^23}, exp{aui 4 + (3v2a}) 
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for vectors V\ — (1, 2, 3, 4), V2 = (1, 4, 9, —1). Consider directions from the origin 
(a, j3) = (0, 0) found by writing a = 8/3 giving, for each 9, a one dimensional full 
exponential family parameterized by j3 in the direction (0+1, 20+4, 30+9, 40 — 1). 
The aspect of this vector which determines the connection to the boundary is the 
rank order of its elements, in particular which elements are the maximum and 
minimum. For example, suppose the first component was the maximum and the 
last the minimum, then as f3 — > ±oo this one dimensional family will be con- 
nected to the first and fourth vertex of the embedding four simplex, respectively. 
In order to see all possible rank orderings of the components, see the right panel 
of Fig. 3 which shows the graphs of the functions {0 + 1, 20 + 4, 30 + 9, 40 — 1}. 
The maximum and minimum ranks are determined by the upper and lower en- 
velopes, shown as solid lines. From this analysis of the envelopes of a set of 
linear functions, it can be seen that the function 20 + 4 is redundant. It can be 
shown that only three of the four vertices of the ambient ^-simplex have been 
connected by the model. This is show explicitly in the left panel of Fig. 3. 





Fig 3. Attaching a two dimensional example to the boundary of the simplex. 



In general, the problem of finding the limit points in full exponential families 
inside simplex models is a problem of finding redundant linear constraints. As 
shown in [6], this can be converted, via duality, into the problem of finding 
extremal points in a finite dimensional affine space. For an alternative approach, 
see [9]. 

2.2. Logistic regression 

A logistic regression model is a full exponential family that lies in a very high 
dimensional simplex when considered as a model for the joint distribution of N 
binary response variates. Consider an N x D design matrix X whose i th row, xf , 
contains the covariate values for the i th case and a binary response t £ {0, 1}^. 
Let s(a) = log ( -jz~ ) so that s~ 1 (a) = 1 "^ ^"/ . , the logistic regression model 
being given by 

P(T l = l) = s- 1 (xfp). 

This is a full exponential family that lies in the (2^ — l)-dimensional simplex 
when considered as a model for the joint distribution of the N binary response 
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variates. A design matrix X defines a Z?-dimcnsional subset - an affine subspacc 
in the natural parameters - and changing the explanatory variates changes the 
orientation of this low dimensional space inside the space of all joint distribu- 
tions. 

Example 3. Consider a logistic regression with 20 cases in which X comprises 
two columns, I20, the vector of all ones, and (1,2,..., 20) T . It is important to 
consider the way that this two-dimensional exponential family is attached to the 
boundary. The generalisation of Fig. 3 is shown in Fig. 4, where here only lines 
which are part of the envelope are plotted. The corresponding vertices which 
the full exponential family reaches are given by vectors of the form z with the 
structure either Zi = for i = 1, . . . , h and 1 for i = h + 1, . . . , 20 or Zi = 1 for 
i = 1, . . . , h and for i = h + 1, . . . ,20. This generalises at once to any single 
covariate taking distinct values. 




Fig 4. Envelope of lines 



2.3. A diagnostic tool 

First order asymptotics essentially assumes that the parameter space can be 
treated as a Euclidean space with a fixed metric, typically the Fisher information 
evaluated either at a hypothesised value or the maximum likelihood estimate. 
This observation allows a simple diagnostic tool to be developed which gives 
a sufficient condition for first order methods to be appropriate. If they were 
appropriate, then the closest point on the boundary should be a large distance, 
as measured in this fixed metric, from the maximum likelihood estimate - this 
length being calibrated using the quantiles of the relevant x 2 distribution. 

When the dimension of the extended multinomial model is high, relative to 
sample size, first order asymptotic approximations hold at best on low dimen- 
sional subspaces. In the case of logistic regression, consider the mean parameter 
space which, with a small but common abuse of notation, we can also consider 
as the space of sufficient statistics. In this space, the vertices which define the 
boundary polytope are also the extremal points in the convex set of attainable 
values of the sufficient statistics, see Figs 2 (a) and 5. In Fig. 5, whose details 
are discussed in Section 3, the contours are defined by the squared distance 
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from the maximum likelihood estimate relative to the Fisher information there. 
The largest contour is calibrated by the 99%-quantile of the x| distribution. 
The diagnostic is simply based on checking if this contour crosses the boundary 
or not. If it does cross the boundary, for example marginally in the left panel 
and strongly in the right, then we regard first order asymptotic normality as 
suspect. 

We note in passing that, in a general multinomial model setting, distances to 
the boundary can be easily computed, via quadratic programming, and in fact 
have a closed form. Let Q Wo (it) be the squared distance from tto to 7r measured 
with respect to the Fisher information at ttq. Using this distance function, the 
squared distance to the face defined by the index set I from the point 7r is 

Qiro( 7r ) = "j J 

1 — 717 

where 717 = ^2 i&I iroi- 

3. Examples and discussion 

Let us return to Example 1. Consider first panel (a) of Fig. 2. This is a plot of 
randomly sampled sufficient statistics - plotted by dots - generated from the 
distribution identified by the maximum likelihood estimate. The vertices in the 
extended multinomial model to which the logistic model is connected correspond 
to a set of points in the sample space - plotted with circles - and the edges which 
connect these points define a one dimensional boundary - plotted with straight 
lines. The image of the boundary is a polytope and it is clear that for the iris 
data this sample is getting very close to the boundary. 

In the example the boundary is just close enough for the first order approxi- 
mations to start to break down. This is happening in a number of ways. There is 
noticeable discreteness in the sample, each vertical streak corresponding to one 
of the extremal points on the boundary. This effect becomes more pronounced 
the closer the boundary becomes, as shown below. 

The second way that the first order approximation breaks down is that the 
effect of higher order terms in the asymptotic expansions are starting to make 
themselves felt. This is illustrated by the solid contour lines, which arc defined by 
the two dimensional Edgcworth expansion of the sampling distribution, see [3] . 
It can be seen that these are not centred ellipses, as would have been expected 
if the normal approximation was adequate. Rather a distortion caused by the 
boundary is becoming evident. 

Panel (b) in Fig. 2 shows the sampling distribution of the maximum likelihood 
estimate. Near the boundary the very high degree of non-linearity between the 
maximum likelihood estimates and the natural sufficient statistics has a very 
strong effect, as can be seen. The strong directional features correspond to 
directions of recession as described in [9] . The effect of this non-linearity can be 
further seen in panel (c) which plots the marginal distribution of j3, the estimate 
of the slope parameter, whose large skewness clearly indicates non-normality. 
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In order to show how all these aspects become stronger when the boundary 
gets closer, consider Fig. 5. The left hand panel shows the same information as 
in Fig. 2, but now with the contours of our proposed diagnostic plotted. This is 
the case where the maximum likelihood estimate from Example 1 is used and, 
as can be seen the diagnostic line, calibrated by the 99% quantile of the xl 
distribution, just touches the boundary. This implies that boundary effects are 
starting to affect inference, as described above. 





Sufficient statistic 



50 55 
Sufficient statistic 



Fig 5. Distribution of sufficient statistics near the boundary 

In the right hand panel of Fig. 5, the sampling has been done from a dis- 
tribution much closer to the boundary. As can be clearly seen, the diagnostic 
curves strongly cut the boundary. The effects on inference in this case are much 
stronger. The discretisation effects are much stronger; in particular, note that 
most of the probability mass on the boundary lies on a relatively small number 
of vertices - plotted as circles in the figure. The effect on the sampling distribu- 
tion of the maximum likelihood estimate is even more extreme. There is now an 
appreciably large probability that a sampled vector of sufficient statistics lies 
on a boundary point, which implies an 'infinite' slope estimate. This gives very 
strong departures from normality for both the joint and marginal distributions. 
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